AD73511 


DECEMBER  1971 


© 


TECHNICAL  BULLETIN  STB  72-8 


A  COMPARISON  OF  FOUR  METHODS  OF  SELECTING 
ITEMS  FOR  COMPUTER  ASSISTED  TESTING 


Rebecca  Bryson 


Reproduced  by 

NATIONAL  TECHNICAL 
INFORMATION  SERVICE 

Sprir.yfiotd,  V».  23151 


APPROVED  FOR  PUBLIC  RELEASE; 
DISTRIBUTION  UNLIMITED. 


sD  D  C 

aizanJM 


JAN  n  .1972 


H5EDTTE 

—  A  - 


UNCLASSIFIED 


DOCUMENT  CONTROL  DATA  .  R  &  D 

Se.„*ny  eta%  a/in  atton  ot  lift a,  body  »./  mbatrmct  ami  w/mij  annotation  mu»t  hr  entered  a  hen  the  overall  report  i*  c lemn/tid) 


CN  -.*»  A  *<  NC,  A  C  T  i  w  1  f  *  /Corporal*  mu /hot)  4».  RM’ORT  JECuAtl*  CLA5SIAICAHON 

Naval  Personnel  and  Training  Research  Laboratory  UNCLASSIFIED 


A  COMPARISON  OF  FOUR  METHODS  OF  SELECTING 
ITEMS  FOR  COMPUTER  ASSISTED  TESTING 


NO  TO  (Type  ot  report  and  metuarve  deter) 


om.ii  (Ftraf  name,  middle  initial,  laat  name) 


Rebecca  Brvson 


t  •«  *»OW  T  OAf| 


December  1971 


trn.  TOTAL  NO  or  PACES  I  Tt».  NO  OT  RCM 


9a.  ONIfilNATOR'l  REPORT  NuMNCNUl 


U  OilTRiluTlON  }TAT(M(NT 


Approved  for  pablic  release;  distribution  unlimited. 


>i  JPONJORIMO  MILI  r  AM  i  activity 


Chief  of  Naval  Personnel  (Pcrs~A3) 
Navy  Department 
Washington,  D.  C.  2G370 


II  AISTRACr 


Four  methods  were  used  to  select  items  for  shortened  versions  of  the  Navy  General 
Classification  Test  (GCT)  and  the  Navy  Mechanical  Aptitude  Test  (MECII) .  Two  of  these 
methods  were  used  to  produce  short  linear  tests,  and  two  to  produce  short  branching 
tests.  Item  response  data  banks  were  used  for  simulated  administration  of  all  short 
tests.  Obtained  scores  were  then  correlated  with  full  length  test  scores.  Navy  re¬ 
cruits'  item  responses  were  used  throughout.  Additional  recruit  samples  were  used  for 
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computer  terminal.  Correlations  with  previously  administered  full  length  tests  were 
obtained. 

It  was  found  that  (1)  When  item  response  data  banks  were  used  for  simulated  item  ad¬ 
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superior  to  the  other  three  approaches  for  developing  short  GCT  and  MECII  tests  which 
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terminals  results  were  less  clear-cut.  For  producing  results  which  parallel  long  test 
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isons,  superior  to  the  two  branching  approaches.  (3)  Mode  of  item  administration 
(paper  and  pencil  vs.  computer)  appeared  to  have  an  effect  on  test  score.  It  is  at 
least  possible  that  computer  terminal  testing  would  result  in  a  loss  of  predictive 
efficiency  when  tests  arc  used  to  predict  external  criteria. 
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A  LABORATORY  OF  THE  BUREAU  OF  NAVAL  PERSONNEL 


SUMMARY 


Problem 

■  I  I— ■  HI  ■  .nil 

If  tests  are  to  be  shortened  for  administration  by  computer  ter¬ 
minal,  it  is  desirable  that  scores  on  the  shortened  tests  parallel 
scores  obtained  on  the  original  full  length  tests.  The  present  study 
was  designed  to  evaluate  several  methods  for  constructing  shortened 
tests.  Additionally,  the  study  investigated  the  possible  advantages 
of  computer  administration  of  tests  tailored  to  each  subject  as 
opposed  to  standard  linear  presentation  of  tests  of  the  same  length. 

Approach 

Four  methods  were  used  to  select  items  for  shortened  versions  of 
the  Navy  General  Classification  Test  (GCT)  and  the  Nts/y  Mechanical ' 
Aptitude  Test  (MECH) .  Two  of  these  methods  were  used  to  produce  short 
linear  tests,  and  two  to  produce  short  branching  tests.  Item  response 
data  banks  were  used  for  simulated  administration  of  all  short  tests. 
Obtained  scores  were  then  correlated  with  full  length  test  scores. 

Navy  recruits'  item  responses  were  used  throughout.  Sample  sizes  for 
item  selection  ranged  from  1,000  to  10,000  and  for  simulated  admin¬ 
istration  of  iteris,  two  samples  of  100  were  used. 

Additional  recruit  samples  were  used  for  administering  short 
linear  tests  in  paper-and-pencil  form  and  short  branching  tests  via 
computer  terminal.  Correlations  with  previously  administered  full 
length  tests  were  obtained. 

Results  and  Conclusions 

1.  When  item  response  data  banks  were  used  for  simulated  item  admin-  • 
istration,  one  approach  requiring  branching  (Wolfe's  BRANCH  method) 
appeared  superior  to  the  other  three  approaches  for  developing  short 
GCT  and  MECH  tests  which  paralleled  the  full  length  tests. 

2.  Results  obtained  when  short  linear  tests  were  administered  in 
paper-and-pencil  form  and  short  branching  tests  were  administered 
via  computer  terminals  were  less  clear-cut.  For  producing  results 
which  parallel  long  test  score,  one  linear  approach  (Moonan's  SEQUIN) 
appeared  as  good  as  and,  in  some  comparisons,  superior  to  the  two 
branching  approaches. 

3.  It  appears  likely  that  mode  of  item  administration  (paper  end 
pencil  vs.  computer)  has  an  effect  on  test  score.  It  is  at  least 
possible  that  computer  terminal  testing  would  result  in  a  loss  of 
predictive  efficiency  when  tests  are  used  to  predict  external  criteria. 

4.  Computer  terminal  administration  of  shortened  conventional  tests 
does  not  appear,  from  this  study,  to  offer  a  substant’  -'1  improvement 


Hi 


over  carefully  designed  paper-and-pencil  tests  of  equivalent  length. 

It  would  thus  appear  worthwhile  to  direct  computerized  testing  efforts 
toward  tapping  those-  aspects  of  mental  ability  which  cannot  be  readily 
measured  by  conventional  paper-and-pencil  tests.  Some  measures  which 
my  relate  to  such -abilities  include  item  response  latencies,  responses 
to  items  when  exposure  time  is  controlled,  profit  from  feedback,  and 
measures  of  responses  to  moving  stimuli. 
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A  COMPARISON  OF  FOUR  METHODS  OF  SELECTING 

items  for  computer  assisted  testing 


A.  INTRODUCTION 

1.  Problem 

There  has  been  much  interest  in  recent  years  in  the  possibility 
of  administering  shortened  tests  via  computer  terminal,  as  a  means 
of  greatly  reducing  testing  time  while  retaining  information  contained 
in  the  full  length  test.  This  poses  the  important  question  of  whether 
or  not  presently  used  full  length  paper-and-pencil  tests  can  be  short¬ 
ened  for  computer  terminal  administration  without  excessive  loss  of 
information. 

A  major  advantage  of  administering  items  on  computer  terminals  is 
that  the  computer  obviates  the  requirement  that  all  persons  be  admin¬ 
istered  the  same  items.  In  other  words,  the  items  administered  may 
be  made  contingent  upon  previous  responses.  The  extent  to  which  this 
capacity  may  be  useftil  is  largely  unknown;  hence,  a  related  question 
is  whether  or  not  this  advantage  permits  the  development  of  short 
branching  tests  which  provide  as  much  or  more  information  than  linear 
paper-and-pencil  tests  of  equivalent  length. 

Because  there  is  no  agreement  on  an  optimr1  method  of  item  selec¬ 
tion,  it  is  impossible  to  answer  these  questions  in  an  absolute  sense. 
It  is,  however,  feasible  to  develop  modes  of  item  'election  designed 
to  capitalize  on  the  branching  capacity  and  to  compare  these  methods 
with  linear  strategies  for  shortening  tests. 

Most  previous  attempts  to  evaluate  branching  tests  have  bean  accom¬ 
plished  by  simulated  administration  of  items  utilizing  item  response 
data  banks.  Although  analyses  of  this  type  are  of  interest,  they  are 
not  substitutes  for  subjecting  shortened  tests  to  actual  tryout. 

In  order  to  understand  the  reasons  underlying  the  selection  of 
methods  for  designing  shortened  tests,  it  is  necessary  to  consider 
the  parameters  involved  in  various  methods  of  item  selection,  how 
these  parameters  have  been  previously  used  to  construct  shortened 
tests,  and  how  they  may  be  expected  to  affect  results. 


Background 


Parallelism  between  short  and  long  tests  is  a  function  of  the 
ways  in  which  known  item  parameters  are  used  in  constructing  shortened 
tests.  The  way  in  which  these  parameters,  particularly  item  difficulty 
and  item  discriminating  power,  should  be  used  to  develop  short  tests 
has  been  the  subject  cf  some  controversy.  Approaches  for  developing 
branching  tests  have  in  the  past  relied  primarily  on  item  difficulty 
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as  a  means  of  selecting  items  to  comprise  a  branching  paradigm.  The 
branching  rule,  stated  in  its  simplest  form  has  been:  If  a  question 
is  answered  correctly,  administer  a  more  difficult  item;  if  incor¬ 
rectly,  administer  an  easier  item.  Using  this  approach.  Bayroff  and 
Seeley  (1967)  developed  verbal  and  quantitative  tests  for  computerized 
administration.  Scores  derived  from  these  short  tests  correlated  more 
highly  with  the  respective  long  test  scores  than  the  expected  value  of 
the  correlation  of  an  equivalent  number  of  randomly  selected  linearly 
administered  items. 

Lord  (1970)  and  Stocking  (1969)  have  considered  in  some  detail 
the  expected  effects  of  various  branching  strategies  on  measurement 
of  different  levels  of  the  ability  range  when  the  experimentally 
manipulated  item  parameter  is  item  difficulty  and  item  discriminating 
power  is  assumed  to  be  constant.  They  have  concluded  that,  theoreti¬ 
cally,  measurement  in  the  extremes  of  the  ability  distribution  should 
be  improved  by  utilizing  the  branching  capacity. 

If  the  pool  of  items  from  which  the  shorter  branching  test  is 
selected  is  scaleable  in  the  Guttman  sense,  i.e.,  if  passing  a  given 
item  implies  that  all  easier  items  will  likewise  be  passed,  then  the 
only  important  consideration  is  item  difficulty.  To  the  extent  that 
test  content  is  not  homogeneous  in  this  sense  (see  Dubois,  1970,  for 
implications  of  various  indices  of  homogeneity),  the  likelihood  of 
selecting  items  which  provide  maximum  criterion  discrimination  is 
diminished  by  attending  solely  to  item  difficulty. 

The  item  parameter  generally  given  primary  consideration  when 
items  are  being  selected  to  comprise  a  short  linear  test  has  been 
the  discrimination  irdex.  If  a  total  test  score  criterion  is  used, 
this  entails  selecting  items  which  correlate  most  highly  with  total 
test  score.  One  risk  in  this  approach  is  that  highly  redundant  items 
might  be  selected  at  the  expense  of  items  involving  important,  but 
relatively  unique  components  of  criterion  variance  (see  Loevinger, 
1954) . 

As  a  remedy  for  this  problem,  Anastasi  (1968)  has  advocated 
selecting  items  for  a  linear  test  according  to  ’’net  effectiveness," 
i.e.,  their  unique  contribution  to  the  prediction  of  total  test  score 
or  some  external  criterion.  She  comments,  however,  that  approaches 
of  this  type  may  be  criticized  on  the  basis  of  expected  unreliability 
of  partial  regression  weights  when  applied  to  single  items.  One  net 
effectiveness  approach  developed  by  Moonan  and  Pooch  (1966)  partially 
circumvent-  the  unreliability  problem  by  selecting  items  in  order  of 
their  contribution  to  a  multiple  R,  then  unit  weighting  each  selected 
item. 

Item  analysis  methods  which  select  items  with  high  discriminating 
power  as  well  as  those  which  select  items  showing  discriminating  power 
in  a  net  effectiveness  sense  have  been  developed  primarily  for  linear 


tests,  but  alsc  may  be  applied  to  branching  tests.  Lord,  Noviek,  and 
Birnbaum  (1968)  have  considered  the  joint  effect  of  discriminating 
power  and  item  difficulty  and  their  relationship  to  ability.  Linn, 

Rock  and  Cleary  (1969;  see  also  Cleary,  Linn  and  Rock,  1968),  have 
attempted,  with  varying  degrees  of  success,  to  incorporate  discrim¬ 
inating  power  into  item  selection  strategies  when  devising  branching 
testa;  however,  their  index  of  discriminating  power  was  based  on  the 
total  group  item-test  point  biserial  rather  than  on  discriminating 
power  for  the  group  to  whom  the  item  would  be  administered. 

Two  methods  of  selecting  items  which  should  theoretically  discrim¬ 
inate  very  accurately  among  the  persons  to  whom  they  are  administered 
are  outlined  below: 

a.  Wright  and  Panchapakesan  Parameters  (WRIPA) .  Wright  and 
Fanchapakesan  (1969)  have  designed  a  program  to  obtain  item  difficulty 
and  item  discriminating  power  estimates  based  on  item  characteristic 
curves.  The  item  difficulty  (log  easiness!  estimate  of  an  item  is 
related  ':o  more  conventionally  obtained  item  difficulty  estimates, 
based  on  the  percentage  passing,  but  tends  to  be  stable  scross  samples 
of  varying  ability.  The  item  discriminating  power  estimate,  however, 
refers  to  the  discriminating  power  among  persons  whose  ability  level 
is  such  that  half  of  them  may  be  expected  to  pass  the  item  and  half  to 
fail  it.  While  no  one  has  derived  an  optimal  way  of  combining  these 
parameters  for  use  in  developing  branching  tests,  it  should  be  possible 
to  avoid  complete  reliance  on  item  difficulty  estimates  by  selecting 
within  each  difficulty  level  the  item  which  shows  the  greatest  discrim¬ 
inating  power. 

b.  BRANCH  Approach.  A  strategy  and  program  for  selecting  items 
(when  branching  is  permitted)  that  should  maximize  the  prediction  of 
total  score  was  demised  by  Wolfe  (1970).  The  program  operates  as 
follows:  Point  biserial  correlations  of  all  items  with  total  test 
score  are  calculated.  The  most  discriminating  item  for  the  total  group 
is  selected  and  the  group  is  partitioned  into  those  who  pass  the  item 
and  those  who  fail  the  item.  Correlations  of  all  remaining  items  with 
total  score  are  then  calculated  for  each  of  the  two  new  groups.  The 
most  discriminating  item  for  each  group  is  the  selected  and  the  groups 
are  split  into  those  who  pass  and  those  who  fail  the  second  item-- 
producing  four  groups.  The  second  item  to  be  selected  may  or  may  not 
be  the  same  for  the  two  groups.  This  process  is  continued  until  a 
specified  number  of  items  is  selected  or  until  the  item  selected  fails 
to  make  a  significant  discrimination  for  the  group  for  which  it  was 
selected.  The  maximum  number  of  groups  produced  will  be  2n  where  n 
equals  the  number  of  items  to  be  administered  to  each  person.  (This 
is,  of  course,  true  only  where  n  is  a  constant.) 

The  WRIPA  approach  is  analogous  to  selecting  items  which  show 
maximal  discriminating  power  for  a  linear  test,  the  major  difference 


being  that  there  is  some  assurance  that  items  selected  have  discrim¬ 
inating  power  for  the  particular  subgroups  to  which  they  will  be 
administered.  The  BRANCH  approach  is  a  net  effectiveness  approach 
in  that  maximally  discriminating  items  are  selected  for  subgroups 
which  are  homogeneous  with  respect  to  previously  administered  items. 
Interesting  comparisons  can  be  made  between  BRANCH  and  other  item 
selection  procedures.  For  example,  it  C3n  be  demonstrated  that  if 
items  were  perfectly  Guttsaan  scaleabie,  BRANCH  would  select  items 
only  or.  the  basis  of  item  difficulty  Or,  if  the  same  items  were 
selected  by  BRANCH  for  all  groups,  then  there  is  evidence  that  a 
linear  test  would  suffice.  (Comparisons  between  configural  scoring 
and  summed  scoring  would  of  course  be  necessary  to  determine  whether 
or  net  the  particular  item®  missed  make  any  difference.) 

In  general,  it  would  appear  that  BRANCH  should  be  an  excellent 
means  of  item  selection,  irrespective  of  the  nature  of  the  total  test 
from  which  items  are  selected.  If  the  test  is  highly  homogeneous  but 
improved  measurement  is  effected  by  administering  items  whose  diffi¬ 
culties  are  reasonably  compatible  with  Ss  ability,  then  the  WRIPA 
scaling  parameters  should  provide  useful  means  of  selecting  items. 

3.  Approach 

The  present  research  compared  the  BRANCH  and  WRIPA  modes  of  item 
selection,  using  a  large  pool  of  previously  collected  item  response 
data  for  item  selection  and  cioss  validation.  The  resulting  short 
tests  were  then  administered  on  computer  terminals.  To  compare  infor¬ 
mation  provided  by  branching  tests  with  that  provided  when  short  easily 
administered  paper-and-pencil  tests  were  used,  two  types  of  linear 
tests  were  devised.  The  first  linear  test  included  items  showing  the 
highest  correlation  with  total  test  score  (the  conventional  way  to 
shorten  tests)  and  is  referred  to  as  the  high  validity  (HI  VAL)  ap¬ 
proach.  Items  included  in  the  second  short  linear  test  were  selected 
by  SEQUIN  to  provide  a  linear  net  effectiveness  comparison.  These 
short  linear  tests  were  administered  in  paper-and-pencil  version. 

In  order  to  compare  ail  four  approaches  across  tests  having  some¬ 
what  different  characteristics,  the  General  Classification  Test  (GCT) 
and  the  Mechanical  Aptitude  Test  (MECH)  were  used.  GCT  is  a  verbal 
test  with  extremely  high  internal  consistency  (^20  =  *975).  MECH 

contains  items  of  two  types,  tool  knowledge  and  mechanical  reasoning. 
(.The  mechanical  reasoning  items  are  similar  to  those  found  on  the 
Bennett  Test  of  Mechanical  Comprehension.)  As  might  be  expected,  MECH 
is  less  homogeneous  (KR2Q  =  .928).  Both  tests  show  a  fairly  wide  range 

of  item  difficulties  (GCT  .93  -  ,20,  and  MECH  .97  -  .08). 

In  general,  the  four  ways  of  deriving  the  short  tests  may  be  de¬ 
picted  as  follows: 
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Each  of  the  four  methods  was  used  to  construct  two  short  tests — 
one  of  items  selected  from  GCT  and  the  other  of  items  from  MEGI.  Sim¬ 
ulated  item  administration  of  the  eight  short  tests  (generally  five 
items  per  person)  was  accomp. ished  using  item  response  data  banks. 

Tests  requiring  branching  wert  then  administered  on  computer  terminals 
while  the  short  linear  tests  v.ere  administered  by  the  usual  paper-and- 
pencil  methods.  Basically,  thei,  the  experiment  included  two  phases, 
the  first  phase  involving  item  selection  £iid  cross  validation  on  large 
item  data  banks,  and  the  second  ^hase  involving  an  experimental  tryout 
of  the  shortened  tests  via  traditional  methods  or  computer. 

4.  Hypotheses 

a.  Major  Hypothesis.  That  extremely  short  tests  (5-6  items)  can 
be  developed  and  administered  via  computer  terminal  with  little  loss 
of  the  information  contained  in  the  total  (100  item)  test. 

b.  Subsidiary  Hypotheses 

(1)  That  BRANCH  is  the  best  means  of  selecting  items  for  a 
shortened  test.  (This  was  anticipated  because  BRANCH  minimizes  redun¬ 
dancy  and  assures  that  each  item  is  maximally  discriminating  for  the 
group  to  whom  it  is  administered.) 

(2)  That  WRIPA  is  the  second  best  item  selection  technique  for 
constructing  a  short  GCT  test,  but  third  best,  for  constructing  a  short 
MECH  test.  (The  fact  that  WRIPA  allows  the  administration  of  items 
varying  in  difficulty  level  but  not  necessarily  contributing  to  the 
prediction  of  all  components  of  criterion  variance  suggests  that  it 
should  provide  a  useful  item  selection  technique  for  constructing  a 
short  GCT  test,  but  because  of  the  somewhat  greater  heterogeneity  of 
MECH,  important  information  would  be  lost  by  the  use  of  the  WRIPA 
procedure.) 

(3)  That  SEQUIN  is  the  second  best  item  selection  technique 
for  constructing  a  short  MEG)  test,  but  third  best  for  constructing 
a  short  GCT  test.  (SEQUIN  allows  for  representation  of  unique 
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components  of  criterion  variance  which  should  bt  useful  with  MECH, 
but  with  GCT  not  so  necessary  as  allowing  persons  to  take  items  com¬ 
patible  with  their  ability  level.) 

(4)  That  the  traditional  HI  VAL  approach  is  the  poorest  method 
of  selecting  items  from  both  GCT  and  MF.CH.  (The  HI  VAL  approach  maxi¬ 
mizes  neither  discrimination  at  the  appropriate  ability  level  nor 
representation  of  unique  components  of  criterion  variance.) 


B.  METHODS 

1.  Recruit  Samples.  All  samples  were  composed  of  men  who  were  in  re- 
cruit  training  at  the  Naval  Training  Center  (NTC),  San  Diego  between 
January  1968  and  May  1971.  Specific  samples  used  were  as  follows: 

a.  Item  responses  to  ICT  and  MECH  were  obtained  for  a  sample  of 
10,000  recruits  (tested  between  January  and  June  1968)  and  used  to 
select  items  according  to  three  or  the  methods  (WRIPA,  BRANCH,  and  HI 
VAL),  used  in  the  present  study. 

b.  The  GCT  and  MECH  item  data  from  1000  Navy  recruits  used  by 
Swanson  (1968)  for  SEQUIN  analyses  were  also  used  in  the  present  study 
to  construct  the  short  SEQUIN  test. 

c.  Two  samples  of  10G  recruits  were  used  for  cross  validation 
(simulated  item  administration).  That  is,  although  complete  item 
response  data  were  available  for  each  of  these  groups,  the  data  were 
used  to  obtain  scores  on  each  of  the  short  tests. 

d.  Two  samples  of  250  recruits  in  their  third  week  of  recruit 
training  were  administered  short  linear  versions  of  GCT  and  MECH,  one 
sample  receiving  items  selected  by  SEQUIN  and  the  other  items  selected 
according  to  the  HI  VAL  approach. 

e.  A  total  of  526  recruits  between  their  third  and  fifth  week  of 
recruit  training  were  administered  items  on  computer  terminals  located 
at  NTC  San  Diego. *  The  Ss  were  randomly  split  into  two  groups  (263  Ss 
in  each  group)  one  of  which  received  WRIPA  versions  of  GCT  and  MECH 
and  the  other,  BRANCH  versions  of  the  same  two  tests. 

2.  Apparatus .  BRANCH  and  WRIPA  tests  were  administered  on  an  IBM 
1500  computer  assisted  instruction  system  superimposed  on  an  IBM  1130 
central  processing  unit.  Thirteen  individual  test  stations  were  used. 


This  was  accomplished  with  considerable  support  from  Dr.  John 
Ford  and  his  staff  at  the  Computer  Assisted  Instruction  Laboratory  of 
the  Naval  Personnel  and  Training  Research  Laboratory.  This  assistance 
is  gratefully  acknowledged. 
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Specific  pieces  of  equipment  used  for  administration  of  items  and 
recording  of  responses  included  a  1510  cathode  ray  tube  display  unit 
with  light  pen  and  keyboard  (located  at  each  of  the  13  test  stations, 
and  used  to  display  items),  a  2310  disc  unit  (for  reading  items  onto 
the  cathode  ray  tube)  and  a  2415  tape  unit  (for  recording  response 
data) .  A  1518  typewriter  was  located  at  the  proctor  station  to  indi¬ 
cate  when  subjects  began  and  completed  che  tests  as  well  as  to 
indicate  any  malfunctions  during  the  testing. 

3.  Procedure 


a.  Initial  Item  Selection  and  Scoring.  Shortened  versions  of 
GCT  and  MECH  were  constructed  according  to  the  following  four  methods: 

(1)  HI  VAL.  Item  validities  for  predicting  total  test  score 
were  obtained  for  each  of  the  100  items  in  GCT  and  the  100  items  in 
MECH.  The  entire  sample  of  10,000  recruits  was  used  for  item  analysis 
and  selection.  In  each  case,  the  five  items  shewing  the  highest  point 
biserial  correlation  with  totrl  test  score  (irrespective  of  item  dif¬ 
ficulty)  were  selected  to  comprise  a  short  test.  Hence,  a  five-item 
GCT  test  and  a  five-item  MECH  test  were  constructed.  Scoring  was  ac¬ 
complished  by  simply  summing  the  correct  responses. 

(2)  SEQUIN.  Five-item  GCT  and  MECH  tests  were  also  constructed 
according  to  the  SEQUIN  procedure.  Previous  SEQUIN  analyses  (Swanson, 
1968)  based  on  samples  of  1000  recruits  were  used  for  selecting  items. 
This  approach  involves  selecting  in  sequence  a  series  of  items  each  of 
which  would  contribute  maximally  to  the  multiple  R,  but  unit  weighting 
each  selected  item  to  obtain  a  score  which  is  useef  in  computing  the 
shortened  tests’  correlation  with  the  total  test  score. 

(3)  WRIPA.  In  order  to  obtain  estimates  of  item  difficulty 
and  of  the  discriminating  power  of  items  at  various  ability  levels,  a 
program  writtex.  by  Wright  and  Panchapakesan  (1968)  was  used.  The 
specific  parameters  obtained  were  log  easiness  and  the  slope  of  the 
item  characteristic  curve  at  the  median  response  (i.e.,  the  point  at 
which  50  percent  of  the  people  pass  the  item  and  50  percent  fail  the 
item).  Items  were  selected  from  GCT  and  MECH  to  construct  approx¬ 
imately  symmetric  distributions  of  log  easiness  estimates  with 
approximately  equal  intervals  between  final  difficulty  levels.  Each 
selected  item  was  the  one  which  showed  the  largest  slope  within  this 
context.  Because  the  resulting  paradigms  for  both  GCT  and  MECH  con¬ 
tained  items  which  were  rather  consistently  easier  than  desired,  one 
additional  difficult  item  was  selected  for  persons  who  answered  all 
five  items  correctly.  The  entire  branching  paradigm  for  GCT  is 
presented  in  Figure  1  and  for  MECH  in  Figure  2.  To  obtain  an  estimate 
of  final  score  for  each  terminal  point,  the  sample  of  10,000  recruits 
was  used.  The  sample  was  successively  sorted  into  groups  passing  and 
failing  each  of  the  indicated  items.  Mean  total  test  score  for  each 
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terminal  point  was  then  obtained  for  all  persons  in  the  sample  of 
10,000.  These  means  were  subsequently  used  as  ''scores"  for  persons 
terminating  at  the  various  points  (see  Figures  1  and  2). 

(4)  BRANCH.  Program  BRANCH  (Wolfe,  1970)  was  used  to  select 
items.  The  general  procedure  has  been  described  previously.  For  the 
present  study,  the  program  was  used  as  follows:  Validities  (point 
biserial  correlations  with  total  score)  were  obtained  for  all  items 
using  the  entire  sample  of  10,000  Ss.  The  most  valid  item  was  used 
to  so~t  the  sample  into  those  who  passed  and  those  who  failed.  Valid¬ 
ities  vere  recomputed  for  each  of  the  two  groups.  The  most  valid 
item  for  each  cf  these  groups  was  then  chosen  and  groups  were  again 
sorted--producing  four  groups.  This  process  was  continued  until  five 
items  had  been  chosen  for  each  person — producing  25  or  32  groups  of 
persons.  Mean  GCT  scores  were  obtained  for  each  of  these  groups. 

(Sample  sizes  for  groups  ranged  from  41  to  1813.)  These  means  were 
used  as  expected  values  of  GCT  score  for  each  terminal  point.  (In 
BRANCH,  each  terminal  point  represents  a  unique  pathway  through  the 
items.)  Items  were  then  selected  from  MECH  in  the  same  manner.  The 
resulting  branching  paradigm  for  GCT  is  presented  in  Figure  3  ~nd  for 
MEGi  in  Figure  4. 

b.  Tryout  Using  Item  Response  Data  Bank.  Item  response  data  to 
full  length  GCT  and  MECH  tests  for  two  samples  of  persons  (N=100  for 
each  test)  were  used  to  simulate  branching.  Items  selected  by  each 
of  the  four  methods  from  each  of  the  two  tests  were  "administered"  to 
persons  in  the  two  samples.  Scores  were  determined  and  the  resulting 
short  test  scores  were  correlated  with  total  length  test  scores  to 
permit  comparisons  of  efficiency  in  replicating  total  test  score. 

c.  Administration  of  Shortened  Tests 

(1)  Paper-and-Pencil  Testing.  Five- item  SEQUIN  and  HI  VAL 
versions  of  GCT  and  MECH  were  administered  at  NTC  San  Diego.  These 
short  tests  were  presented  in  specially  printed  test  booklets.  (Addi¬ 
tion?-  items  were  subsequently  presented  but  are  not  relevant  to  the 
present  study.)  The  two  SEQUIN  tests  were  administered  to  one  sample 
of  250  recruits  and  the  HI  VAL  tests,  to  another  sample  of  the  same 
size.  Four  minutes  were  allowed  for  administration  of  each  test. 

Test  scores  were  obtained  by  simply  summing  nunfcer  of  correct  responses. 
These  scores  were  then  correlated  with  total  test  score.  (The  full 
length  tests  had  been  administered  three  weeks  previously  during  rou¬ 
tine  classification  testing.) 

(2)  Computer  Testing.  WRIPA  and  BRANCH  tests  were  programmed 
for  computer  terminal  administration.  Ail  instructions  and  sample 
items  (the  same  sample  items  as  given  with  SEQUIN  and  HI  VAL  tests) 
were  administered  via  the  terminals.  WRIPA  versions  of  GCT  and  MECH 
were  administered  to  a  sample  of  263  persons,  and  BRANCH  versions  of 
the  same  tests  to  another  sample  of  263.  Responses  were  made  by 


10 


r 


o 

4-< 


D. 

3 

O 


M 


u 

o 

4| 


ft 

3 

*H 

u 

a> 

to 

•rt 

XI  tJ 

u 

+J  G 
3  V 
•H  4-» 

O  to 

°-s 

4->  -H 

w  e 
o  *3 
*j  3 

e  e 

4)  O 

4-*  jg 

•h  * 


o 


T3 

OC 

V) 

•  O 

C 

G 

lf>  u 

‘r4 

h3 

o 

T3 

3 

4-» 

H 

G  <0 

O 

*— <  *H 

u 

*-• 

cu  c 

o 

•H 

£  *H 

4-> 

!•§ 

TJ 

C 

3 

G 

•H 

ft 

4-» 

o 

3  E 

O 

ex. 

O  *-»  O 

G 

ft  O  X 

H 

c 

a,  *■*  * 

G 

o 

to 

•H  *J 

S  u  o 

to  to 

«On 

to 

•rt  o 

44 

6 

u  *-> 

— •  G. 

G 

o 

3  —<  3 

+-» 

*o  *-« 

G30 

•H 

cfl 

O  -H  U 

w 

X  p. 

U  >30 

O  *H 

o 

32 

pj  DO  H  1/1 

G 

C  *H 

o  -h  a 

►M 

t(  SW  £  <H 

a 

ti  o 

< 

3 

X  4>  X 

e 

•M  C  *j 

< 

c 

rH  »H  rH 

a 

5> 

3  0  3 

>  u 

o  a.  o 

a 

•H  5> 

•f*  «H 

2; 

MX  4l  +J  Vt 

H 

s 

<4-1  tO  tn 

c  3 

•H  G  «H 

U 

O  C  ”3  -M  -3 

JZ 

*H 

<-»  E 

E  E  6 

tx 

3  Oi 

GOG 

a 

E  4-> 

•M  4->  *■> 

H  "-• 

•»H  »H  *H 

f- 

o 

b 

4t 

a 

C  • 

•  «  • 

T- 

M  K)tf 

to 


0> 


G 

• 

4-« 

OO 

O 

•H 

2: 

U* 

11 


«■ 


jSSfeij;  k'iiaS.  *^«.  iilfSt.'te'. 


IM!W  l"  U'  4"  F  5  JJ  i 


"d 

«u 

* 

0) 

4-> 

<D  >/> 
fH  «H 

a,  c 

g  -H 

(/)  •§ 
ct$ 

H 

rt  e 
♦->  o 

o  x 

P  3  T> 

O 

^  O  h 
o  p  a> 

<P  p 

p.  <n 

rl  3  -H 

rt  O  C 
•H  P  ’ H 

P  ME 

O  "O 

— v 

</>  P  « 

u 

•  H  O 

z 

X  *P  6 

O 

1 

p  >.x 

CO 

C  P  3 
*H  «-* 

0  3  0 

| 

a  u 

p 

•rH 

oo 

WHO, 

O 

in  vm  3 

p 

O  -i- 1  O 
tj  u 

a 

ws 

o 

S  E  , 

p 

OOP 
p  p  o 

oe 

■H  •«  tp 

*■»♦ 

l/> 

o 

•-4 

"C/ 

-rt 

.  •  a 

a  «  't  -h 

o 

r~< 

u 

u 

o 

o 

*"■« 

in 

cl 

•  f“< 

•H 

XI 

•c 

4-> 

<D 

c 

p 

•H 

c 

o 

o 

•H 

o 

£- 

<u  O 

r-t  B, 

o 

£ 

O. 

</) 

o 

E  P 

H 

p  m  m 

tfl 

Irt 

in  in  o 

e 

<U  P 

<u 

u 

p  -« 

4-» 

0/ 

«  E 

»H 

~3 

—1  p  o 
a  o»i 

v.-* 

o 

C  P  -H 
•H 

§ 

4tf 

60  P 

►H 

CD 

..4  O  • 

C* 

p  «P  l/> 

< 

4-> 

Ctf 

°  X 

% 

C  P 

a. 

a 

*H  H 

Q> 

3 

a 

> 

O 

z; 

»H 

o  *1-4 

*— « 

MXJ  *P 

*r^ 

S 

3 

U 

c 

)2 

o 

c  -a 

2 

•H 

C3m 

6  B 

CO 

B 

o>  a> 

4-»  4J 
•H  *H 

0 

o 

Uj 

«p 

S, 

c 

•  • 

CM 


O 

P 

O 

Z 


SJ> 

*H 

U- 


12 


A 


Jt  V 


’TUCK7' 


subjects*  touching  the  spot  beside  the  correct  response  with  a  light 
pen.  Responses,  response  latencies,  item  scores  and  expected  value 
of  total  test  score  were  recorded  for  each  subject.  If  the  subject 
had  not  responded  after  spending  45  seconds  on  each  item,  a  time 
warning  was  given.  Ten  more  seconds  were  then  allowed  and  if  a  re¬ 
sponse  had  not  been  made  by  then  the  response  was  considered  in¬ 
correct  and  the  next  indicated  item  was  administered. 

Groups  of  13  recruits  were  brought  into  the  Computer 
Assisted  Instruction  Laboratory  at  30-minute  intervals.  The  actual 
amount  of  time  spent  on  the  terminal  ranged  between  7  and  20  minutes. 
Scores  on  full  length  GCT  and  MF.CH  tests  (administered  2-4  weeks 
previously)  were  obtained  for  all  subjects  and  merged  with  the  short 
test  data.  Expected  values  of  total  test  scores  obtained  from  admin¬ 
istration  of  the  short  branching  tests  were  correlated  with  actual 
total  test  scores. 


C.  RESULTS  AND  DISCUSSION 

The  major  hypothesis,  that  extremely  short  tests  (5-6  items)  can 
be  developed  and  administered  via  computer  terminal  with  little  loss 
of  information  contained  in  the  total  (100  item)  test,  appeared  to 
be  supported  when  short  GCT  tests  were  developed  and  item  administra¬ 
tion  was  simulated.  Perhaps  because  of  the  greater  heterogeneity  of 
MECH,  5-6  item  tests  were  not  as  good  as  short  GCT  tests.  When  the 
short  branching  tests  were  administered  via  computer  and  the  short 
linear  tests  administered  in  paper-and-pencil  form,  the  branching 
tests  offered  no  advantage  over  the  linear  tests.  Furthermore,  all 
short-test  approaches  resulted  ir«  greater  information  loss  than  was 
incurred  with  the  simulated  runs. 

1.  Simulated  Administration 


Obtained  correlations  of  short  test  scores  with  long  test  scores 
(GCT  and  MECH)  are  listed  in  Table  1  for  the  two  samples  of  100  re¬ 
cruits.  Base  values  for  comparing  these  simulated  runs  with  the 
expected  value  of  the  correlation  of  a  random  set  of  five  items  taken 
from  the  long  test  were  established  by  using  the  Spearman- Brown 
formula.  These  were  .66  for  GCT  and  .40  for  MECH.  For  both  tests, 
all  approaches  represent  substantial  improvement  over  these  base 
values . 

The  most  consistent  finding  was  that  irrespective  of  method  used 
for  selecting  items,  correlations  between  short  and  long  GCT  tests 
are  much  higher  than  those  between  short  and  long  MECH  tests.  It 
appears  from  these  results  that  GCT  could  be  substantially  shortened 
to  five  or  six  item  length  without  appreciable  loss  of  information; 
but  that  MECH  tests  this  short  are  not  very  satisfactory. 
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TABLE  1 

Cross-Validated  Correlations  of  Simulated  Short  Test  Scores 
With  otal  Test  Scores..  Using  Four  Item  Selection 
Procedures  and  TWo  Tests 


Item 

Selection 

Procedures 

GCT 

MECH 

Sample  1 

Sample  2 

Sample  1 

Sample  2 

a  K 

BRANCH  ’ 

.95 

.92 

.83 

.80 

WRIPA5 

.90 

.89 

.69 

.70 

HI  VALC 

.87 

.86 

.67 

.69 

SEQUIN3 

.94 

.89 

.73 

.70 

N 

100 

100 

100 

100 

3Methods 

which  reduce 

redundancy  ( 

i.e.,  attend  to 

item  co- 

variances) . 

^Methods  which  discriminate  at  appropriate  ability  level 
(hence  require  administering  different  items  to  different 
persons) . 

Q 

Method  which  relies  solely  on  item-test  point  biserial 
correlation. 


Of  the  four  methods  evaluated  the  HI  VAL  approach  produced  the 
poorest  results  on  both  GCT  and  MECH.  Using  HI  VAL,  the  simulated 
short  GCT  test  scores  correlated  .87  and  .86  with  total  score  and 
the  short  MECH  test,  .67  and  .69  with  total  score. 

Other  comparisons  emong  methods  are  less  conclusive.  It  was 
expected  that  SEQUIN  and  WRIPA  would  be  better  item  selection  proce¬ 
dures  than  HI  VAL  and  poorer  than  BRANCH.  Kith  respect  to  comparisons 
between  SEQUIN  and  WRIPA,  it  was  hypothesized  that  because  GCT  is  an 
extremely  homogeneous  test  and  MECH  is  relatively  heterogeneous, 

WRIPA  would  be  a  better  item  selection  procedure  for  constructing  a 
short  GCT  test  and  SEQUIN  for  constructing  a  short  MECH  test.  The 
actual  findings  suggest  that  SEQUIN  is  a  slightly  better  procedure 
for  constructing  a  shortened  test  even  when  test  content  is  extremely 
homogeneous.  That  is,  for  both  GCT  and  MECH  short  SEQUIN  tests  were 
slightly  better  than  the  respective  short  WRJPA  tests. 
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Because  program  BRANCH  successively  selected  maximally  discrim¬ 
inating  items  for  groups  defined  by  patterns  of  previous  item 
responses  (i.e.,  utilized  all  available  information  in  item  selection) 
it  was  hypothesized  that  the  short  tests  constructed  by  BRANCH  would 
be  superior  to  those  constructed  by  any  other  method.  For  these 
simulated  item  administration  cross-validations  this  was  indeed  the 
case.  The  correlations  between  BRANCH  score  and  total  score  on  GCT 
for  the  two  cross-validation  samples  were  .95  and  .92  and,  on  MECI1 
.83  and  .80. 

These  simulated  short  test  results  suggest  that  if  tests  are  to 
be  shortened,  very  large  samples  are  available  for  selecting  items, 
and  computer  terminals  are  available  for  administering  items,  the 
BRANCH  program  should  provide  an  excellent  means  of  constructing 
tests  to  parallel  the  longer  form. 

2.  Administration  of  Shortened  Tests 

Unfortunately,  ambiguous  results  were  obtained  wh^n  the  shortened 
tests  were  actually  administered  as  such.  The  linear  shortened  tests 
were  administered  in  pcper-and-pencil  form  and  branching  tests  were 
administered  by  computer  terminal. 

Correlations  of  short  test,  scores  with  scores  obtained  on  the 
total  test  (administered  2-4  weeks  previously)  are  listed  in  Table  2. 
While  SEQUIN  was  still  a  better  item  selection  procedure  than  HI  VAL 
and  BRANCH  better  than  WRIPA,  previously  demonstrated  differences 
between  BRANCH  and  SEQUIN  were  not  maintained.  In  fact,  results  ob¬ 
tained  with  a  five-item  SEQUIN  test  were,  for  MECH,  slightly  better 
than  those  obtained  using  a  five-item  BRANCH  test  (r  =  .74  as  opposed 
to  r_  =  .73).  For  GCT,  the  results  were  equivalent  Jr  =  .83). 

These  comparisons  are  critical,  for  the  process  of  adapting  tests 
for  cerputer  administration  is  a  very  expensive  one  which  requires 
that  definite  advantages  of  this  mode  of  administration  be  demonstrated. 
To  the  contrary,  the  present  results  suggest  that  as  much  information 
may  be  derived  from  a  short  linear  paper-and-pencil  test  as  from  the 
more  complex  short  branching  test. 

The  fact  that  ex  >ected,  results  were  obtained  with  the  simulated 
runs,  but  not  with  the  on-line  runs  may  possibly  be  due  to  either  or 
both  of  the  following  factors: 

a.  Because  of  its  use  of  successive  sample  splits  for  determining 
the  sequential  items  to  administer,  BRANCH  may  captialize  on  error  to 
a  much  greater  extent  than  SEQUIN.  However,  if  this  were  the  case, 
the  discrepancy  would  also  be  expected  to  be  apparent  in  the  simulated 
cross-validation  runs.  Furthermore,  it  should  be  recalled  that  very 
large  samples--10,000  recruits— were  used  to  select  items  for  the 
BRANCH  procedure  and  1000  for  the  SEQUIN  procedure,  thus  reducing  the 
likelihood  of  captitalization  on  chance. 
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TABLE  2 


Correlation  of  Short  Tests 
With  Total  Test  Score 


Item 

Selection 

Procedures 

GCT 

MECH 

N 

BRANCH3 

.83 

.73 

263 

WRIPA3 

.79 

.72 

263 

SEQUINb 

* 

00 

UJ 

.74 

250 

HI  VALb 

o 

00 

• 

.73 

250 

Administered  on  jomputer  terminal  2- 
4  weeks  after  administration  of  total 
test. 


Administered  in  paper-and-pencil 
version  three  weeks  subsequent  to  admin¬ 
istration  of  total  test. 


b.  A  more  plausible  explanation  is  that  the  BRANCH  correlations 
were  substantially  lowered  because  of  the  switch  in  mode  of  item  ad¬ 
ministration.  This  possibility  is  supported  by  the  fact  that  when 
WRIPA  tests  were  administered  via  computer  terminal  much  poorer 
results  were  obtained  than  had  been  obtained  by  the  simulated  runs. 

While  the  £s  appeared  extremely  interested  in  taking  the  tests  on 
computer  terminals  and  there  were  no  complaints  about  clarity  of 
instructions,  etc.,  the  procedures  represented  a  marked  deviation 
from  standard  testing  conditions.  In  addition  to  the  novelty  of  the 
equipment  used  to  project  and  record  responses,  test  content  was 
rather  subtlely  altered.  Each  item  was  timed  separately  (very  few 
items  were  unanswered)  and  no  provision  was  made  for  returning  to 
previously  answered  items.  It  had  not  been  anticipated  that  these 
features  of  computer  assisted  testing  would  make  very  much  difference. 

Several  items  which  appear  toward  the  end  of  the  long  tests  (both 
GCT  and  MF.CH)  had  been  selected  for  BRANCH.  These  may  have  been  items 
which,  more  than  anything  else,  discriminated  between  those  who  com¬ 
pleted  the  long  test  and  those  who  did  not.  With  items  timed  separately 
for  computer  administration,  all  persons  were  exposed  to  all  items. 
Difference  of  this  sort  could  have  served  to  influence  results  sub¬ 
stantially. 
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laken  as  a  whole,  the  present  study  indicates  that  credence  can¬ 
not  be  placed  in  results  obtained  from  simulated  item  administration 
strategies  if  the  purpose  is  to  eventually  produce  rests  to  be  admin¬ 
istered  on  computer  terminals.  While  lower  correlations  with  total 
score  were  obtained  with  actual  short  linear  tests  than  with  the 
simulated  linear  tests,  some  decrement  was  to  be  expected  because  of 
the  time  span  between  the  two  administrations  of  the  items  and  be¬ 
cause  of  the  changed  context  in  which  the  items  were  presented. 
However,  the  finding  that  on-line  administered  branching  tests  are 
no  better  than  their  short  linear  paper-and-pencil  administered 
counterparts  suggests  a  large  effect  at  least  partially  attributable 
to  mode  of  administration. 


D.  RECOMMENDATIONS  FOR  FUTURE  RESEARCH 

The  major  problem  investigated  in  the  present  study  (namely,  can 
long  tests  be  shortened  for  computer  terminal  administration  without 
excessive  loss  of  information)  was  not  conclusively  answered.  In  the 
simulated  item  administration  runs  it  appeared  that  one  strategy 
(BRANCH)  requiring  administration  of  different,  items  to  different 
persons  was  most  effective.  The  results  obtained  when  these  items 
were  administered  by  computer  terminal  were  not  nearly  so  compelling. 

It  should  be  determined  whether  the  problem  is  due  to  the  changed 
mode  of  item  administration,  for  there  appears  to  be  little  reason  to 
discredit  the  methods  of  item  selection.  To  this  end,  it  would  be 
desirable  to  administer  the  total  test  via  computer  terminal  and 
correlate  the  scores  so  obtained  with  scores  obtained  by  traditional 
paper-and-pencil  administration.  This  correlation  value  should  then 
be  compared  with  the  paper  and  pencil  test-retest  reliability  (with 
the  same  time  span  between  administrations).  To  the  extent  that  the 
latter  correlation  is  higher  than  the  former  one,  there  is  evidence 
for  an  effect  due  to  changed  mode  of  administration. 

If  failure  to  replicate  total  test  score  does  stem  from  differences 
in  mode  of  administration,  it  should  be  determined  whether  or  not  this 
loss  results  in  loss  of  predictive  power.  If  predictive  powei  is  lost, 
then  it  is  perhaps  necessary  to  revise  conceptions  of  the  role  of  com¬ 
puterized  testing.  When  the  goal  is  to  shorten  paper-and-pencil  tests, 
it  appears  that  they  can  be  adequately  shortened  by  using  the  SEQUIN 
approach,  which  does  not  require  computerized  testing.  In  developing 
tests  for  administration  by  computer,  it  would  seem  wise  to  take  ad¬ 
vantage  not  only  of  its  branching  capabilities  but  also  of  its  capacity 
to  standardize  item  presentation  time,  record  response  latencies, 
present  moving  visual  stimuli,  provide  feedback,  etc.  Measures  thus 
obtained  may  be  related  to  previously  untapped  dimensions  of  on-job 
performance.  In  short,  the  promise  of  computerized  testing  may  best 
be  realized  by  developing  methods  which  supplement  rather  than  repla  •? 
paper-and-pencil  tests. 
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